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MINIMAX ESTIMATION WITH THRESHOLDING AND ITS 
APPLICATION TO WAVELET ANALYSIS 

By Harrison H. Zhou 1 and J. T. Gene Hwang 2 

Yale University and Cornell University 

Many statistical practices involve choosing between a full model 
and reduced models where some coefficients are reduced to zero. Data 
were used to select a model with estimated coefficients. Is it possible 
to do so and still come up with an estimator always better than 
the traditional estimator based on the full model? The James-Stein 
estimator is such an estimator, having a property called minimaxity. 
However, the estimator considers only one reduced model, namely 
the origin. Hence it reduces no coefficient estimator to zero or every 
coefficient estimator to zero. In many applications including wavelet 
analysis, what should be more desirable is to reduce to zero only the 
estimators smaller than a threshold, called thresholding in this paper. 
Is it possible to construct this kind of estimators which are minimax? 

In this paper, we construct such minimax estimators which per- 
form thresholding. We apply our recommended estimator to the wavelet 
analysis and show that it performs the best among the well-known 
estimators aiming simultaneously at estimation and model selection. 
Some of our estimators are also shown to be asymptotically optimal. 

1. Introduction. In virtually all statistical activities, one constructs a 
model to summarize the data. Not only could the model provide a good and 
effective way of summarizing the data, the model if correct often provides 
more accurate prediction. This point has been argued forcefully in Gauch 
(1993). Is there a way to use the data to select a reduced model so that 
if the reduced model is correct, the model-based estimator will improve on 
the naive estimator (constructed using a full model) and yet never do worse 
than the naive estimator even if the full model is actually the only correct 
model? James-Stein estimation (1961) provides such a striking result under 
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the normality assumption. Any estimator such as the James-Stein estimator 
that does no worse than the naive estimator is said to be minimax. See the 
precise discussion right before Lemma 1 of Section 2. The problem with 
the James-Stein positive part estimator is, however, that it selects only 
between two models: the origin and the full model. It is possible to construct 
estimators similar to the James-Stein positive part to select between the full 
model and another linear subspace. However, it always chooses between the 
two. The nice idea of George (1986a, b) in multiple shrinkage does allow the 
data to choose among several models; it, however, does not do thresholding, 
as is the aim of the paper. 

Models based on wavelets are very important in many statistical applica- 
tions. Using these models involves model selection among the full model or 
the models with smaller dimensions where some of the wavelet coefficients 
are zero. Is there a way to select a reduced model so that the estimator 
based on it does no worse in any case than the naive estimator based on the 
full model, but improves substantially upon the naive estimator when the 
reduced model is correct? Again, the James-Stein estimator provides such a 
solution. However, it selects either the origin or the full model. Furthermore, 
the ideal estimator should do thresholding; namely, it gives zero as an esti- 
mate for the components which are smaller than a threshold, and preserves 
(or shrinks) the other components. However, to the best knowledge of the 
authors, no such minimax estimators have been constructed. In this paper, 
we provide minimax estimators which perform thresholding simultaneously. 

Section 2 develops the new estimator for the canonical form of the model 
by solving Stein's differential inequality. Sections 3 and 4 provide an approx- 
imate Bayesian justification and an empirical Bayes interpretation. Section 
5 applies the result to wavelet analysis. The proposed method outperforms 
several prominent procedures in the statistical wavelet literature. Asymp- 
totic optimality of some of our estimators is established in Section 6. 

2. New estimators for a canonical model. In this section we shall con- 
sider the canonical form of the problem of a multinormal mean estimation 
problem under squared error loss. Hence we shall assume that our observa- 
tion 

z=(z 1 ,...,z d )~N(e,i) 

has a ci-dimensional normal distribution with mean 9 = (9\,... ,9^), and a 
known covariance identity matrix /. The case when the variance of Z{ is not 
known will be discussed briefly at the end of Section 5. 

The connection of this problem with wavelet analysis will be pointed out in 
Sections 5 and 6. In short, Zj and 9i represent the wavelet coefficients of the 
data and the true curve in the same resolution, respectively. Furthermore, 
d is the dimension of a resolution. For now, we shall seek an estimator of 
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9 based on Z. We shall, without loss of generality, consider an estimator of 
the form 5{Z) = (5i(Z), . . . , 5d(Z)), where, 

S i (Z) = Z i + g i {Z), 

where g(Z) : R d -> R, and search for g(Z) = (gi{Z), . . . , g d {Z)). To insure 
that the new estimator (perhaps with some thresholding) does better than 
Z (which does no thresholding), we shall compare the risk of S(Z) to the 
risk of Z with respect to the £2 norm. Namely, 

E\\8(Z)-9\\ 2 = Ej2(S i (Z)-9 i ) 2 . 

i=l 

It is obvious that the risk of Z is then d. We shall say one is as good as the 
other if the former has a risk no greater than the latter for every 9. Moreover, 
one dominates the other if it is as good as the other and has smaller risk for 
some 9. Also we shall say that an estimator strictly dominates the other if 
the former has a smaller risk for every 9. Note that Z is a minimax estimator, 
that is, it minimizes sup e E\5°(Z) — 9\ 2 among all 5°(Z). Consequently any 
S(Z) is as good as Z if and only if it is minimax. 

To construct an estimator that dominates Z, we use the following lemma. 

Lemma 1 [Stein (1981)]. Suppose that g : R d — ► R d is a measurable func- 
tion with gi(-) as the ith component. If for every i, gi{-) is almost differen- 
tiable with respect to the ith component and 



E 

then 



dzM z) 



< 00 for i = 1, . . . , d, 



E e \\Z + g(Z) - 9\\ 2 = E e {d + 2V • g(Z) + || 9 (^)|| 2 }, 



where V • g(Z) = Y^t=i ■ H^nce if g(Z) solves the differential inequality 



(1) 2V-g(Z) + \\g(Z)\\ 2 <0, 

the estimator Z + g(Z) strictly dominates Z. 

Remark. gi(z) is said to be almost differentiable with respect to z%, if 
for almost all Zj, j 7^ i, gi{z) can be written as a one-dimensional integral 
of a function with respect to Z{. For such Zj's, j 7^ i, gi(Z) is also called 
absolutely continuous with respect to z% in Berger (1980). 

To motivate the proposed estimator, note that the James-Stein positive 
estimator has the form 



3JS 

i 
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with c+ =max(c, 0) for any number c. This estimator, however, truncates 
independently of the magnitude of \Z{\. Indeed, it truncates all or none of the 
coordinates. To construct an estimator that truncates only the coordinates 
with small \Zi\, it seems necessary to replace d — 2 by a decreasing function 
of \Zi\ and consider 

where D, independent of i, is yet to be determined. [In a somewhat different 
approach, Beran and Diimbgen (1998) construct a modulation estimator 
corresponding to a monotonic shrinkage factor.] With such a form, Of = if 
h(\Zi\) > D, which has a better chance of being satisfied when \Z{\ is small. 

We consider the simple choice h(\Zi\) = a\Zi\~ 2 / 3 , and let D = £|Zj| 4 / 3 . 
This leads to the untruncated version 9 with the ith component 

(2) § i (Z) = Z i + g i (Z) where 9i (Z) = -aD' 1 sign(Z t )\Z i \ 1 / 3 . 

Here and later sign(Zj) denotes the sign of Z^. It is possible to use other 
decreasing functions and other D. 

In general, we consider, for a fixed < 2, an estimator of the form 

(3) § i = Z i +g i (Z), 
where 



sign^)^/" 1 



d 



(4) 9l (Z) = -a-^^— and D = ^\Z 



\0 



Although at first glance it may seem hard to justify this estimator, it has a 
Bayesian and empirical Bayes justification in Sections 3 and 4. It contains, 
as a special case with j3 = 2, the James-Stein estimator. Now we have: 

Theorem 2. For d>3 and 1 < (3 < 2, 0(Z) is minimax if and only if 

0<a<2G3-l)inf E o(D~ l Eti\Zif- 2 ) 

Proof. Obviously for Zj ^ 0, Vj 7^ i, gi{Z) can be written as the one- 
dimensional integral of 

-£-9i{Z) = /3(-a)(-l)D- 2 |^|^- 2 ) + {13- l){-a)D~\\Z^- 2 ) 
dZi 

with respect to Zj. (The only concern is at Zi = 0.) Consider only nonzero ZSs, 
j 7^ i. Since (3 > 1, this function, however, is integrable with respect to Zi 
even over an integral including zero. It takes some effort to prove that 
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E(\-rj^-gi(Z)\) is finite. However, one only needs to focus on Zj close to zero. 

Using the spherical- like transformation r 2 = ■> we may show that if 

d > 3 and f3 > 1, both terms in the above displayed expression are integrable. 
Now 

\\ g (Z)f=a 2 D- 2 J2\Z^- 2 . 

i=l 

Hence 

E e \\Z + g(Z) - 6\\ 2 < d for every 9 

if and only if 

E e {2V -g{Z) + || 5 (^)|| 2 } < for every 6, 

that is, 

E e ^(2P)D- 2 ^\Z t \W-V 
(5) -{2(l-2)D- l j2\Z l f- 2 \ +a 2 D~ 2 Y / \Z i \ 2 P- 2 \<0 

i=l J i=l ) 

for every 6, 

which is equivalent to the condition stated in the theorem. □ 

Theorem 3. The estimator 9(Z) with the ith component given in (3) 
and (4) is minimax provided < a < 2((3 — l)d — 2(5 and 1 < /3 < 2. Unless 
f3 = 2 and a is taken to the upper bound, 6{Z) dominates Z. 

Proof. By the correlation inequality 

^Ei^ 2 )<(Ei^r 2 )(Ei^i /3 )- 

Strict inequality holds almost surely if (3 < 2. Hence 

E e {D- 2 Y?i=i\Zi\W- 2 ) ~ (l/d)E e D-iJ2\Z l \P- 2 ■ 

Hence if < a < 2((3 — l)d — 2/3, the condition in Theorem 2 is satisfied, 
implying minimaxity of 0{Z). The rest of the statement of the theorem is 
now obvious. □ 



The following theorem is a generalization of Theorem 6.2 on page 302 
of Lehmann (1983) and Theorem 5.4 on page 356 of Lehmann and Casella 
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(1998). It shows that taking the positive part will improve the estimator 
componentwise. Specifically for an estimator (9\(Z), . . . ,9d(Z)) where 

e l (z) = (i-h i (z))z i , 

the positive part estimator of 0i(Z) is denoted as 

§f(Z) = (l-h i (Z)) + Z i . 

Theorem 4. Assume that hi{Z) is symmetric with respect to the ith 
coordinate. Then 

E e {e l -~ei) 2 <E e {e i -e i ) 2 . 

Furthermore, if 

(6) P e (hi(Z) >1)>0, 
then 

E e {e i -et?< E e{8 i -0i) 2 - 

Proof. Simple calculation shows that 

(7) E e (9i - ~ef? ~ Ee{0i - 0~i) 2 = E e {{9f) 2 - 9 2 ) - 29^(9+ - k)- 

Let us calculate the expectation by conditioning on hi{Z). For hi{Z) < 1, 
df = Oi. Hence it is sufficient to condition on hi(z) = b where b > 1 and show 
that 

E e {{9t) 2 - 9 2 MZ) = b) - 29iE e {9f - 9~i\hi{Z) = b) < 0, 
or equivalently, 

-E e (9 2 \hi(Z) = b) + 20<(1 - b)Eg(Zi\hi(Z) = b) < 0. 
Obviously, the last inequality is satisfied if we can show 

9 i E e (Z i \h i (Z) = b)>0. 
We may further condition on Zj = Zj for j ^ i and it suffices to establish 

(8) 9 l E e (Z i \h l (Z) = b, Zj = Zj ,j + i) > 0. 

Given that Zj = zj, j ^ i, consider only the case where hi(Z) = b has so- 
lutions. Due to symmetry of hi{Z), these solutions are in pairs. Let iy*., 
k £ K, denote the solutions. Hence the left-hand side of (8) equals 

9 i E e (Z i \Z i = ±y k ,kEK) 

= J2 0iE e (Z z \Zi = ±y k )P e (Zi = ±y k \Z l = ±y k , k G K). 
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Note that 

which is symmetric in 9iyk and is increasing for QiUk > 0. Hence (9) is 
bounded below by zero, a bound obtained by substituting Oiy^ = in (9). 
Consequently we establish that (7) is nonpositive, implying that 9 + in as 
good as 9. 

The strict inequality of the theorem can be established by noting that the 
right-hand side of (7) is bounded above by Eq[(9^~) 2 — 9 2 ] which by (6) is 
strictly negative. □ 



Theorem 4 implies the following corollary. 



Corollary 5. Under the assumptions on a and (5 in Theorem 3, 9 + 
with the ith component 

(10) 9+ = {l-aD- 1 \Z l f- 2 ) + Z l 

strictly dominates Z . 



It is interesting to note that the estimator (10), for /3 < 2, does give 
zero as the estimator when the \Zi\ are small. When applied to wavelet 
analysis, it truncates the small wavelet coefficients and shrinks the large 
wavelet coefficients. The estimator lies in a data-chosen reduced model. 

Moreover, for (5 = 2, Theorem 3 reduces to the classical result of Stein 
(1981) and (10) to the positive part James-Stein estimator. The upper 
bound of a for domination stated in Theorem 3 works only if /3 > 1 and 
d > /3/(f3 — 1). We know that for (3 < |, 9 fails to dominate Z because of the 
calculations leading to (11) below. We are unable to prove that 9 dominates 
Z for ^ < < 1- However, for such /3's, 9 has a smaller Bayes risk than Z if 
the condition (11) below is satisfied. 

A remark about an explicit formula for a. In wavelet analysis, a vast 
majority of the wavelet coefficients of a reasonably smooth function are 
zero. Consequently, it seems good to choose an estimator that shrinks a 
lot and hence using a larger than the upper bound in Theorem 3 is desir- 
able. Although Theorem 2 provides the largest possible a for domination in 
the frequentist sense, the bound is difficult to evaluate in computation and 
hence difficult to use in a real application. Hence we took an alternative ap- 
proach by assuming that 9i are independently identically distributed (i.i.d.) 
iV(0,T 2 ). It can be shown by a detailed calculation [see Zhou and Hwang 
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(2003)] that the estimator (3) and (4) has a smaller Bayes risk than Z for 
all t 2 if and only if 




(11) 0<a< a/3 = 2/E 

.i=l \i=l 

where are i.i.d. standard normal random variables. 

What is the value of apt It is easy to numerically calculate the bound an 
by simulating £j, which we did for a up to 100. It is shown that ap, = 
is at least as big as (5/3) (d — 2). Using Berger's (1976) tail minimaxity 
argument, we come to the conclusion that 6 + , with the ith component 

would possibly dominate Z. For various d's including d = 50 this was shown 
to be true numerically. 

To derive a general formula for ap for all f3, we then establish that the 
limit of ap/d as d— > oo equals, for 1/2 < /3 < 2, 

(13) Cp = 4[T((/3 + l)/2)] 2 /[^r((2/? _ i)/ 2 )]. 

It may be tempting to use (d — 2)Cp. However, we recommend 

(14) o=(0.97)(d-2)C /J , 

so that at (3 = 4/3, (14) becomes (5/3)(d — 2). Berger's tail minimaxity 
argument and many numerical studies indicate that this a enables (10) to 
have a better risk than Z. 

3. Approximate Bayesian justification. It would seem interesting to jus- 
tify the proposed estimation from a Bayesian point of view. To do so, we 
consider a prior of the form 

i/(\\o\bf c , \\0\\f»l, 

where \\6\\p = (£ \\9i\f) 1/l3 , and c is a positive constant which can be speci- 
fied to match the constant a in (10). In general the Bayes estimator is given 
by 

Z + Vlogm(Z), 

where m{Z) is the marginal probability density function of Z, namely, 

, , -(1/2)112-011* 

" !(z> = /"7^7!^* (9)de ' 



1,(6) 



The following approximation follows from Brown (1971), which asserts that 
Vlogm(Z) can be approximated by Vlog7r(Z). The proof is given in the 
Appendix. 
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Theorem 6. With tt(9) and m(X) given above, 

lim V '* 0gm , (Z >=l. 

\Zi\-y+oo V;log7r(Z) 

Hence by Theorem 6, the ith component of the Bayes estimator equals 
approximately 

Zi + V» log tt(Z) = Zi- - 



E\Zi\ P 

This is similar to the untruncated version of 9 in (2) and (3). 

4. Empirical Bayes justification. Based on several signals and images, 
Mallat (1989) proposed a prior for the wavelet coefficients 9i as the expo- 
nential power distribution with the probability density function (p.d.f.) of 
the form 

(15) f(9i) = ke-^, 
where a and (3 < 2 are positive constants and 

k = l3/{2aT{l/l3)) 

is the normalization constant. See also Vidakovic [(1999), page 194]. Using 
the method of moments, Mallat estimated the values of a and (3 to be 1.39 
and 1.14 for a particular graph. However, a and (3 are typically unknown. 

It seems reasonable to derive an empirical Bayes estimator based on this 
class of prior distributions. First we assume that a is known. Then the Bayes 
estimator of 9i is 

d 

Zi + T^-logm(Z). 

Similar to the argument in Theorem 6 and noting that for (3 < 2, 

e -\e i +z i f/cfi /e -\9 t f/cfi^ 1 as 0.^^ 

the Bayes estimator can be approximated by 

(16) Zi + ^logvr(^) =Zi- ^\Zif- 1 fflgn(Zi). 

Note that, under the assumption that a is known, the above expression is 
also the asymptotic expression of the maximum likelihood estimator of 9{ 
by maximizing the joint p.d.f. of (Zi,9i). See Proposition 1 of Antoniadis, 
Leporini and Pesquet (2002) as well as (8.23) of Vidakovic (1999). In the 
latter reference, the sign of Zi of (16) is missing due to a minor typographical 
error. 
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Since a is unknown, it seems reasonable to replace a in (16) by an esti- 
mator. Assume that O^s are observable. Then by (15) the joint density of 
(6 1 ,...,6 d ) is 

' P 
_2aT(l/p) 

Differentiating this p.d.f. with respect to a gives the maximum likelihood 
estimator of a 13 as 

(17) (pmf)/d. 

However, Oi is unknown and hence the above expression can be further 
estimated by (16). For (3 < 2, the second term in (16) has a smaller order 
than the first when is large. Replacing Oi by the dominating first term 
Zi in (16) leads to an estimator of a 13 as {(3Y\Zi\P) / d. 
Substituting this into (16) gives 

which is exactly the estimator Oi in (2) and (3) with a = d. Hence we have 
succeeded in deriving Oi as an empirical Bayes estimator when Zi is large. 

5. Connection to the wavelet analysis and the numerical results. Wavelets 
have become a very important tool in many areas including mathematics, 
applied mathematics, statistics and signal processing. They are also applied 
to numerous other areas of science such as chemometrics and genetics. 

In statistics, wavelets have been applied to function estimation with amaz- 
ing results of being able to catch the sharp change of a function. Celebrated 
contributions by Donoho and Johnstone (1994, 1995) focus on developing 
thresholding techniques and asymptotic theories. In the 1994 paper, rela- 
tive to the oracle risk, their VisuShrink is shown to be asymptotically op- 
timal. Further in the 1995 paper, the expected squared error loss of their 
SureShrink is shown to achieves the global asymptotic minimax rate over 
Besov spaces. Cai (1999) improved on their result by establishing that Block 
James-Stein (BlockJS) thresholding achieves exactly the asymptotic global 
or local minimax rate over various classes of Besov spaces. 

Now specifically let Y = (Yi, . . . , Y n )' be samples of a function /, satisfying 

(18) Y i = f(t i ) + e ii 

where t% = (i — l)/n and Ei are i.i.d. iV(0,<7 2 ). Here a 2 is assumed to be 
known and is taken to be 1 without loss of generality. See a comment at the 
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end of the paper regarding the unknown a case. One wishes to choose an 
estimate / = (/(ii), . . . , f(t n )) so that its risk function 



(19) 



E \\f-f\f=Ej2m)-f(ti)Y 



is as small as possible. Many discrete wavelet transformations are orthogo- 
nal transformations. See Donoho and Johnstone (1995). Consequently, there 
exists an orthogonal matrix W such that the wavelet coefficients of Y and / 
are Z = WY and 9 = Wf. Obviously the components Z\ of Z are indepen- 
dent, having a normal distribution with mean Q{ and standard deviation 1. 
Hence previous sections apply and exhibit many good estimators of 6. Note 
that, by orthogonality of W, for any estimator 5(Z) of 6, its risk function is 
identical to W'5(Z) as an estimator of / = W'O. Hence the good estimators 
in previous sections can be inversely transformed to estimate / well. 

In all the applications to wavelets discussed in this paper, the estimators 
(including our proposed estimator) apply separately to the wavelet coeffi- 
cients of the same resolution. Hence in (12), for example, d is taken to be 
the number of coefficients of a resolution when applied to the resolution. In 
all the literature that we are aware of, this has been the well. 



Blocks 



Bumps 



Doppler 




500 1000 1500 



HeaviSine 



PiecePolynomial 



PieceRegular 




500 1000 1500 



Fig. 1. The curves represent the true curves f(t) in (18). 




123451234512345123451234512345 
Blocks Bumps Doppler HeavtSine PiecePolynomial PieceRegular 



Fig. 2. In each of the six cases corresponding to Blocks, Bumps, and so on, the eight 
curves plot the risk functions, from top to bottom, when n = 64, 128, . . . , 8192. For each 
curve (see, e.g., the top curve on the left), the circles "o" from left to right give, with respect 
to n, the relative risks of VisuShrink, Block James-Stein, SureShrink and the proposed 
methods (12) and (20). 

In addition to considering the estimator (12), which is a special case of 
(10) with p = 4/3, we also propose a modification (10) with an estimated 
(3. The estimator (5 for (3 is constructed by minimizing, for each resolution, 
the Stein unbiased risk estimator (SURE) for the risk of (10). The quantity 
SURE is basically the expression inside the expectation on the right-hand 
side of (A. 4) summing over i, 1 < i < d, except that a is replaced by an. 
[Note that D in (A. 4) depends on (3 as well.] The resultant estimator is 
denoted as 

(20) s = (10) with [3 replaced by 0. 

Figure 1 gives six true curves (made famous by Donoho and Johnstone) 
from which the data are generated. For these six cases, Figure 2 plots the 
ratios of the risks of the aforementioned estimators to n, the risk of Y. 
Since most relative risks are less than 1, this indicates that most estimators 
perform better than the raw data Y. Our estimators 9 + in (12) and in 
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VisuShrink (simulated risk = 2.41) 



SureShrink (simulated risk = 1.13) 





BlockJS (simulated risk = 1 .94) 



Proposed Method (12) (simulated risk = 0.86) 





Fig. 3. Solid lines represent the true curves, and dotted lines represent the curves cor- 
responding to various estimators. The simulated risk is based on 500 simulations. 



(20), however, are the ones that are consistently better than Y . Furthermore, 
our estimators 9 + and 9 s virtually dominate all the other estimators in risk. 
Generally, 9 performs better than 9 + virtually in all cases. 

As shown in Figure 2, the difference in risks between 9 + and 9 s is quite 
minor. Since 9 + is computationally less intensive, we focus on 9 + for the 
rest of the numerical studies. 

Picturewise, our estimator does slightly better than other estimators. See 
Figure 3 for an example. Note that the picture corresponding to 9 + distin- 
guishes most clearly the first and second bumps from the right. 

Based on asymptotic calculation, the next section also recommends a 
choice of a in (21). It would seem interesting to comment on its numerical 
performance. The difference between the a's defined in (14) and (22) is 
very small when 64 < n < 8192 and when (3 is estimated by minimizing 
SURE. Consequently, for such (5, the risk functions of the two estimators 
with different a's are very similar, with a difference virtually bounded by 
0.02. The finite sample estimator [where a is defined in (14)] has a smaller 
risk about 75% of the time. 
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Fig. 4. Proposed estimator (12) 



to reconstruct Figure 1. 



The James-Stein estimator produces very attractive risk functions, some- 
times as good as the proposed estimator (12). However, it does not seem to 
produce good graphs. Compare Figures 4 and 5. 

In the simulation studies we use the procedures MultiVisu and MultiHy- 
brid which are VisuShrink and SureShrink in WaveLab802. See http: / /www- 
stat. stanford.edu/~wavelab. We use Symmlet 8 to do wavelet transforma- 
tion. In Figure 2 signal-to-noise ratio (SNR) is taken to be 3. Results are sim- 
ilar for other SNRs. To include the block thresholding result of Cai (1999), 
we choose the lowest integer resolution level j > log 2 (log n) + 1. 

A comment about the case where er 2 is not known to be 1. When a is 
known and is not equal to 1, a simple transformation applied to the problem 
suggests that (10) be modified with a replaced by aa 2 . When a is unknown, 
one could then estimate a by a, the proposed estimator for a in Donoho and 
Johnstone [(1995), page 1218]. With this modification in (12) (and even with 
the SURE estimated (5), the resultant estimators are not minimax according 
to some numerical simulations. However, they still perform the best or nearly 
the best among all the estimators studied in Figure 2. 
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n = 1024 
SNA = 3 

Fig. 5. James-Stein positive part applied to reconstruct Figure 1. 



6. Asymptotic optimality. To study the asymptotic rate of a wavelet 
analysis estimator, it is customary to assume the model 

(21) Y i = f(t i ) + e i , i = l,...,n, 

where ti = (i — l)/n and are assumed to be i.i.d. N(0,1). The estimator 
/ for /(•) that can be proved asymptotically optimal applies estimator (10) 
with 

(22) a = d(2lnd) {2 -P )/2 m f3 , < < 2, 
and 



E\eif = 2^ 2 T((f3 + 2)/2)^ 



to the wavelet coefficients Z{ of each resolution with dimensionality d of 
the wavelet transformation of the Y^'s. After applying the estimator to each 
resolution one at a time to come up with the new wavelet coefficient esti- 
mators, one then uses the wavelet base function to obtain one function / in 
the usual way. 

To state the theorem, we use B^ q to denote the Besov space with smooth- 
ness a and shape parameters p and q. The definition of the Besov class 
Bp q (M) with respect to the wavelet coefficients is given in (A. 19). Now the 
asymptotic theorem is given below. 
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Theorem 7. Assume that the wavelet tp is t-regular, that is, ift has t 
vanishing moments and t continuous derivatives. Then there exists a con- 
stant C independent of n and f such that 

(23) sup E [ \f(t)-f(t)\ 2 dt<C(]nn) l -P/ 2 n~ 2a / ( - 2a+1 \ 

f€B« q (M) J0 
for all M > ; < a < r, q>l and p > max(/3, — , 1). 

The asymptotic optimality stated in (23) is as good as what has been es- 
tablished for hard and soft thresholding estimators in Donoho and Johnstone 
(1994), the Garrott method in Gao (1998) and Theorem 4 in Cai (1999) and 
the SCAD method in Antoniadis and Fan (2001). However, the real advan- 
tage of our estimator is in the finite sample risk as reported in Section 5. 
Also our estimators are constructed to be minimax and hence have finite 
risk functions uniformly smaller than the risk of Z. This estimator 6 A for 
(3 = 4/3, however, has a risk very similar to (12). See Section 5. 



APPENDIX 



Proof of Theorem 6. Assume that \Zi\ > 1. We have 

Vilogm(Z) = ir(Z) (d/dZjmjZ) 

\z™oo Vilogvr(Z) |Zi|™oom(Z) ' {d/dZ^Z) ' 



We shall prove only 



|Zi|^oo Tt(Z) 




since 



lim 



oc 



(d/dZj)m(Z) 
(8/dZiMZ) 



= 1 



can be similarly established. 
Now 






mi + m,2 



say. 
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Obviously, as \Zi\—> +00, m\ has an exponentially decreasing tail. Hence 

lim — \r = 0. 

By the change of variable 9 = Z + y, we have 

J J\\z+v\\ p >i (V2^) p \\Z + y\\f 

To prove the theorem, it suffices to show the above expression converges 
to 1. In doing so, we shall apply the dominated convergence theorem to show 
that we may pass to the limit inside the above integral. After passing to the 
limit, it is obvious that the integral becomes 1. 

The only argument left is to show that the dominated convergence theo- 
rem can be applied. To do so, we seek an upper bound F(y) for 

\\Z\\f/\\Z + y\\f when \\Z + y\\p>l. 



Now for \\Z + y\\n > 1, 



that is, 



\z\\f<C p (\\z + y\\f + \\y\\f), 



^ < ^ f 1 + J^L) ^ + llvllf) 



\ z + y\\p v \\ z + y\\p 
'p(i + IM|J 



Hence if we take C p (l + \\y\\g C ) as F(y), then 



/•••/ —L-^ e -(w\\v\r F (y )dy<+00 . 

J J\\Z+y\y>l (V27r)P 

Consequently, we may apply the dominated convergence theorem, which 
completes the proof. □ 

Proof of Theorem 7. Before relating to model (21), we shall work 
on the canonical form: 

Zi = 6i + crEi, i = 1,2, . . . ,d, 

where a > and the e,'s are independently identically distributed standard 
normal random errors. Here 9 = (9±, . . . ,9d) denotes the estimator in (10) 
with a defined in (22). For the rest of the paper C denotes a generic quantity 
independent of d and the unknown parameters. Hence the C's below are not 
necessarily identical. 

We shall first prove Lemma A.l below. Inequality (A.l) will be applied 
to the lower resolutions in the wavelet regression. The other two inequalities 
(A. 2) and (A. 3) are for higher resolutions. □ 
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Lemma A.l. For any < (3 <2, < S < 1, and some C > 0, indepen- 
dent of d and the QiS, we have 



(A.l) 

and 

(A.2) 



Y^Etfi-Oif <Ca 2 d(\ud) 



(2-/3)/2 



i=l 



E{6i - 9if < C(0? + a 2 d & - l {\nd) 

d 



1/2) 



i=i 



5 2 mpd. 



Here and below mp denotes the expectation of |£j| , denned right above 
the statement of Theorem 7. Furthermore, for any < (3 < 1, there exists 
C > such that 



(A.3) 



E^i-Oif <Ca 2 \nd. 



Proof of Lemma A.l. Without loss of generality we will prove the 
theorem for the case a = 1. By Stein's identity, 



E0i - e t ) 2 = E 
(A.4) 



\ + {Z 2 -2)I l 
a 2 \Z^- 2 

+ ■ 



I 7 1/9-2 I 7 12/3-2 

2a (/3-l)^L + 2a/3 l 



Z? 



L> 2 v < ' D h D 2 

Here Ii denotes the indicator function I(a\Zi\P~ 2 > D) and Zf = 1 — h- Con 
sequently 

(A.5) h = l \i\Zi\ 2 ' 13 <a/D 

and 

(A.6) Zf = l iia\Zif- 2 /D<l. 

From (A.4) and after some straightforward calculations, 



Ej2(6i-hf 



(A.7) 



d + E 



Li=l 



+ 



D 



\ZA? 



2(/3-l) + 2/3^-) If 
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Using this and the upper bounds in (A. 5) and (A. 6), we conclude that (A. 7) 
is bounded above by 



d + E 



* a\Zi\P | a\Zj\P | 2 p\Z 



i=i 



D 



D 



D 



+ 2\[3 -l\d<C (In d^-^^d, 



completing the proof of (A.l). 
To derive (A. 2), note that 

E(l + (Z? - 2)h) = o 2 + E(-Z 2 + 2)1! . 

This and (A. 4) imply that 

'a\Zi\P- 2 



E{9i - 6if =6l + E 
+ E 
+ E 



D 



Z 2 - Z 2 



-2(0-1 

2f3a 



\z,\v- 2 



D 

ZA?- 2 MAP 



+ 2 



If 



by 



(A. 



D D 

Using (A. 7), one can establish that the last expression is bounded above 



6 2 + E[(-2(J3 - 1) + 2)1!) + E2/3^If < O 2 * + m + mi 



<9f + 8El!. 
We shall show under the condition in (A. 2) that 
(A.9) EI! < C(\9i\ 2 + d 5 ~\\ogdy 1 ' 2 ). 

This and (A. 8) obviously establish (A. 2). To prove (A.9), we shall consider 
two cases: (i) < (3 < 1 and (ii) 1 < (3 < 2. For case (i) note that, for any 
5 > 0, EI! equals 

P(a\Zif~ 2 <D) = P(D > a\Zif- 2 , \Z { \ < (2\nd) 1 ' 2 /{l + 5)) 

+ P(D > a\Zif~ 2 , \Zi\ > {2\nd) 1 ' 2 /(l + *))■ 

Obviously the last expression is bounded above by 

(A.10) P{D> (1 + 5) 2 - 15 dm p ) + P(\Zi\ > (21nd) 1/2 /(l + 8)). 

Now the second term is bounded above by 

(A.ii) cde^ + id^Vh^y 1 ) 
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by a result in Donoho and Johnstone (1994). To find an upper bound for 
the first term in (A. 10), note that by a simple calculus argument 

due to < P < 1. Hence the first term of (A. 10) is bounded above by 

p(J:\eif>{l + 6)^<top-^\oA. 

Replacing J2 l^il 13 by the assumed upper bound in (A. 2), the last displayed 
expression is bounded above by 

(A.12) p(j2 dm rt0- + - ( 2 - z^ 2 ]) • 

Using the inequality 

(1 + J) 2 "' 3 > 1 + (2 - p)S, 

one concludes that the quantity inside the bracket is bounded below by 

l + (2-/3)(5-5 2 )>l. 

Hence the probability (A.12) decays exponentially fast. This and (A. 11) then 
establish (A.9) for < P < 1. 

To complete the proof for (A. 2), all we need to do is to prove (A.9) for 
case (ii), 1 < P < 2. 

Similarly to the argument for case (i), all we need to do is to show that 
the first term in (A. 10) is bounded by (A. 11). Now applying the triangle 
inequality, 

D^<{T,\^f) 1/P + (T,\^ 1/p 



to the first term of (A. 10) and using some straightforward algebraic manip- 
ulation, we obtain 

P(D > (1 + 5f- p dmp) 

(A.13) 

>(2-gV/? 2 ~P X 2/0\ 



< p [ g > ^[{(i + sf-w - 



Note that 



y 1 ~ 2/3 

and consequently the quantity inside the brackets is bounded below by 

1 + ^—^{S - <5 2//3 )l >l + (2-p)(5- 5 2 ' p )/2 > 1. 
2p 
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Now this shows that the probability on the right-hand side decreases ex- 
ponentially fast. Hence inequality (A. 9) is established for case (ii) and the 
proof for (A. 2) is now completed. 

To prove (A. 3) for < j3 < 1, we may rewrite (A. 4) as 



(A.14) 



F; 



2(1 -P)E 



D 



-If 



Inequality (A. 3), sharper than (A.l), can be possibly established due to the 
critical assumption j3 < 1, which implies that 



(A.15) 



IZA 2 ^ 2 < [ - 



-(2-2/3)/(2-/3) 



if If = 1. 



Note that the last term in (A.14) is obviously bounded above by 2(1 — (5). 
Furthermore, replace |Zj| 2 ^ -2 in the third term on the right-hand side of 
(A.14) by the upper bound in (A.15) and replace Z 2 in the second term by 
the upper bound 



\Z;\ 2 < 



d) 



2/(2-/3) 



when ij = 1, 

which follows easily for (A. 5). We then obtain an upper bound for (A.14): 

a X 2/(2-/3) 



l + E 



+ E 



DJ 

< (3 - 2(5) + CE 



d) W +2 d*) 



+ 2(1-/3) 



DJ 



2/(2-/3) 



Here in the last inequality 2[3a/D 2 was replaced by 2f3a 2 /D 2 . To establish 
(A. 3), obviously the only thing left to do is 



(A.16) 



E 



a 

DJ 



\ 2/(2-/3) 



<C\n(d). 



This inequality can be established if we can show that 

d\ 2/(2-/3) 



(A.17) 



E 



DJ 



since the definition of a and a simple calculation show that 

2 /(2-/3) =C7o 2/(2-/3) ln(d)i 
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To prove (A. 17), we apply Anderson's theorem [Anderson (1955)] which 
implies that \Zi\ is stochastically larger than \ei\. Hence 



d \ 2/(2-/3) 



<E 



I/ 3 



2/(2-/3) 



which is bounded by A + B. Here 



A = E 



2/(2-/3) 



/ ^Sif <dmp/2 



i=l 



and 



B = E 



d/(ENI 



2/(2-/3) 



i=l 



and as before /(•) denotes the indicator function. 
Now £> is obviously bounded above by 

< C. 

Also by the Cauchy-Schwarz inequality 



A 2 < E 



<*/(Ei 



4/(2-/3) 



i=l 



Here the last inequality holds since the probability decays exponentially fast. 
This completes the proof for (A. 17) and consequently for (A. 3). □ 



Now we apply Lemma A.l to the wavelet regression. We only prove the 
case < [3 < 2. For [3 = the proof is similar and simpler. Equivalently we 
shall consider the model 

(A.18) Z jk = 9 jk + e jk /V^, k = l,...,2 j , 

where the 6j k s are wavelet coefficients of the function /, and the £jkS 
are i.i.d. standard normal random variables. For the details of reasoning 
supporting the above statement, see, for example, Section 9.2 of Cai (1999), 
following the ideas of Donoho and Johnstone (1994, 1995). Also assume that 
O's live in the Besov space Bp q (M) with smoothness a and shape parameters 
p and q, that is, 

/ \ g/p 

(A.19) Y, 2i9ia+1/2 ~ 1/p) (Y,\ 9 jk\ P ) <M q 

j \ k J 
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for some positive constants a, p, q and M. The estimator 9 below for model 
(A. 18) refers to (20) with a defined in (22) and a 2 = 1/n. For such a 9, the 
total risk can be decomposed into the sum of the following three quantities: 

Ri = X X E{Qjk — &jk) 2 , 

j<jo k 

R2 = X X - 6jk) 2 , 

J>j>jo k 

R3 = X ~~ Ojk) 2 , 

j>J k 

where j = [log 2 (C (5 n 1 /( 2 «+ 1 ))] and C s is a positive constant to be specified 
later. Applying (A.l) to Ri, which corresponds to the risk of low resolution, 
we establish by some simple calculation 

(A.20) R x < C(lnn)^^ 2 n~ 2a ^ 2a+l l 

For j > jo, (A. 19) implies 

(A.21) X \ e 3k\ P < M p 2- jp( - a+1/2 ~ 1/p) = M p 2 j 2~ jp( - a+1/2) . 

k 

Furthermore, for p> (3 

2 -jp(a+l/2) < 2 -j/3(a+l/2) < 2 -j 0/ 3(a+l/2) = N-/3(a+l/2) ^ 

Choose Cs > such that 

-(^)'(^t)V 

This then implies that 

161., IP < ¥1 oi J3 

satisfying the condition in (A. 2) for d = 2 J and 5 = (2a + l) -1 . 
Now for p > 2 we give an upper bound for the total risk. 
From (A. 2) we obtain 

R2 + R 3 <CJ2J2 d h + o(n~ 2 «l^ ) 
j>jo k 

and from Holder's inequality the first term is bounded above by 

j>j \ k / 



< 



24 H. H. ZHOU AND J. T. G. HWANG 



Then inequality (A. 21) gives 

R 2 + Rs < C J2 2^ 1 - 2 M2-^ a+1 / 2 ~ 1 M + (n- 2a ^ 2a+ V) 
j>jo 

= CY,^ j2a + o(n- 2a/ ^ 2a+l) ) 
j>jo 

<Cn~ 2a ^ 2a+1 \ 

This and (A.20) imply (23) for < < 2 and p > 2. 

Note that for (3 = 2 the proof can be found in Donoho and Johnstone 
(1995). For (3^2 our proof is very different and much more involved. 

To complete the proof of the theorem, we now focus on the case < (3 < 2 
and 2 > p > max{l/a,/3} and establish (23). We similarly decompose the 
risk of 6 as the sum of R\, R2 and R3. Note that the bound for R\ in (A.20) 
is still valid. Inequalities (A. 2) and (A. 3) imply 

«*< E EftA^*.' 1 



J>3>]Q k 



for some constants C > 0. Furthermore, the inequality 

J^Xi AA< A^J^xl, Xi>0,A>0,l>t>0, 

implies 



E E^A^f E Ei 



J>j>jo k 71 n ' J>j>jo k 



hk?- 



Some simple calculations, using (A. 21), establish 
(A.22) 



logrA 1 p / 2 



n 



R2 <c(^) Y, 2-^ a+1 / 2 - 1 / p ) + (n- 2Q /( 2Q+1 ); 

J>j>30 



KCilogn) 1 -^- 2 ^ 2 ^. 
From Holder's inequality, it can be seen that i?3 is bounded above by 

2/p 



j>j \ k J 



Similarly to (A.22), we obtain the upper bound of R3, 

R 3 <CJ2 2-^ a+1 / 2 ^M = ( n - 2Q /( 2a+1 )), 

3>J 

where J is taken to be log 2 n. Thus for < (3 < 2 and 2 > p > max{l/a, /?}, 
we have 

sup E\\0 -0\\ 2 KCilogn) 1 ^ 2 ^ 2 */^^. 
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